In [1]:
%matplotlib inline
import pandas as pd
import seaborn as sns
from stemgraphic import stem_graphic
In [2]:
texas = pd.read_csv('salaries.csv')
In [3]:
texas.describe(include='all')
Out[3]:
In [4]:
%time ax = texas.Annual_salary.hist()
Let's try with seaborn's distplot.
In [5]:
%time g = sns.distplot(texas.Annual_salary)
Ah yes. We have to do some data munging before we can use it, removing the NaN (not a number or null) values. That should also be taken into consideration when talking about performance, if you have to do extra steps.
In [6]:
%time g = sns.distplot(texas.Annual_salary.dropna())
A little slower, nicer looking, but, again, not as informative as we'd like. Let's see how stem_graphic does.
In [7]:
%time fig, ax = stem_graphic(texas.Annual_salary, display=500, random_state=1235)
We can see a lot of detail. The extremes ARE extreme ($5.2M). I can also see the trend for non managerial, managerial, upper management and beyond (350K). And I can look at this for much large sets of data.
In [8]:
!head yellow_tripdata_2015-01.csv
In [9]:
%time !wc yellow_tripdata_2015-01.csv
In [10]:
%time df1 = pd.read_csv('yellow_tripdata_2015-01.csv')
From the above results, we see that the C written optimized word count (wc) application took close to 17 seconds to count words in the file. wc is used as a reference low watermark. It is expected to be faster than anything in Python loading that same document.
And sure enough, pandas took longer at around 29 seconds to load.
In [11]:
df1.head()
Out[11]:
In [12]:
%%time
fig, ax = stem_graphic(df1.total_amount, display=500);
loading the data (29s) and displaying it (less than 1.5s) took a little over 30 seconds. Less than 31s at any rate is quite acceptable for 12 million rows.
But we can do slightly better with dask (although, this scenario with a single local file and a laptop with only 4 core and limited memory and IO bandwidth doesn't fully demonstrate the real power of dask compared to a server or cluster and/or multiple files)
In [13]:
# you need to have the dask module installed to run this part
import dask.multiprocessing
dask.set_options(get=dask.multiprocessing.get)
import dask.dataframe as dd
In [14]:
%%time
df = dd.read_csv('yellow_tripdata_2015-01.csv')
stem_graphic(df.total_amount, display=500)